Leveraging Giant Text Corpora to Enhance the Coverage of Pattern-based Information Extraction Systems
ثبت نشده
چکیده
Pattern-based approaches for Information Extraction typically apply a pattern learner to a set of domain-specific documents to generate extraction patterns that comprise the IE system. This limits the coverage of the system to the expressions and language constructs used within the training data. This research exploits the vast quantities of text readily available in large corpora, such as The Gigaword Corpus, to expand the coverage of existing pattern-based Information Extraction systems.
منابع مشابه
Estimating Relevance and Semantic Compatibility for IE Pattern Discovery in Large Text Corpora
Pattern-based approaches for Information Extraction (IE) typically apply a pattern learner to a set of domain-specific training documents to generate extraction patterns for the IE system. This restricts the coverage of the system primarily to the expressions and language constructs that appear within the limited training data. Our research looks to the vast quantities of readily available text...
متن کاملاستخراج پیکره موازی از اسناد قابلمقایسه برای بهبود کیفیت ترجمه در سیستمهای ترجمه ماشینی
Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...
متن کاملCorpus based coreference resolution for Farsi text
"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...
متن کاملArabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents
Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...
متن کاملTowards a Structured Representation of Generic Concepts and Relations in Large Text Corpora
Extraction of structured information from text corpora involves identifying entities and the relationship between entities expressed in unstructured text. We propose a novel iterative pattern induction method to extract relation tuples exploiting lexical and shallow syntactic pattern of a sentence. We start with a single pattern to illustrate how the method explores additional paterns and tuple...
متن کامل